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ABSTRACT 

The data flow concept of computation seeks to achieve high performance by allowing 
concurrent execution of instaictions based on the availability of data. This thesis explores the 
translation of a subset of the high level language Val to data flow graphs. The major problem in 
performing this translation for the target machine, the Dcnnis-Misunas data flow computer, stems from 
the restriction that graph execution sequences place at most one value on any given arc at any time. 
Hie data/acknowledge arc pair transformation is introduced as a means of implementing this required 
operational behavior. Its effect on data flow graph operation is subsequently explored as it relates to 
correctness and performance. 

Though the arc transformation enables graphs to be executed without the possibility of 
deadlock, the resulting overhead and the potential loss of some concurrency represent significant costs. 
Two techniques aimed at minimizing these problems arc developed for optimizing transformed graphs. 
ITic optimization to eliminate unneeded acknowledge arcs analyzes Val constructs to identify arc pairs 
which may permit removal of their acknowledge arc. The optimization to balance token flow specifics a 
method of inserting identity operators into a graph for the purpose of pipelining input sets, and thereby 
increasing graph throughput. Though developed within the context noted, the translation and 
optimization issues described should prove applicable to other data flow architectures. 
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CIIAPTERONE 
l.t Introduction 

The short history of computing asascicnec is unique in its unparalleled rate of technological 
growth. In response to this, the demand for greater levels of computing power has risen as rapidly. 
Anticipating the continuation of this trend, research in the area of parallel computation seeks to achieve 
high performance by manipulating programs to 1 exploit the parallelism inherent in many problems. 
Though this has led to the introduction of "do in parallel" Constructs within certain languages, the 
sequential nature of conventional machine programming has proved to be a barrier to the formulation 
of an adequate and practical approach. The dfrfa jfoH*eoneept of computation Overcomes this difficulty 
by allowing die availability of data to determine the execution sequence, rather than a sequential 
instruction counter: In the data flow model, an operation is executed assodriasits reqttircd operands 
have been computed. Hie development of this concept has resulted in the proposal of several data flow 
machine architectures and associated data flow languages, This -thesis addresses certain language 
translation problems which arise in translating the high level data flow language, Val[2] for the 
Dcnnis-Misunas data flow machincfll]. 

The concept of data flow is best illustrated by data flow graphs which explicitly show the date 
dependencies of operations in a data flow program. The operators and arcs of data flow graphs are 
viewed as an abstraction of the instaiction cells and operand registers of the data flow machine and as 
such, provide a model for describing translation problems. The chapter proceeds with a more detailed 
look at the components and operation of data flow graphs, followed by a brief look at the high level 
data flow language^ Val and its translation into graph form. The major problem, termed safely, which 
arises in making the translation wHl be identified and discussed in section 1.4. While resolving me 
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safety issue is straightforward, the solution introduces a ^secondary, more subtle set of problems to the 
graphs. Section 1.5 identifies these along with several optimizations of the initial solution aimed at 
minimizing such problems, an expanded discussion of which forms a major portion of this thesis. ITic 
chapter concludes with a synopsis of the remainder of the thesis. 

1.2 Data Flow Graph Operation 

The basic components of directed data Jfowigmphziuv operators and arcs which join the 
operators. When an operator fires, it absorbs values or tokens fnrm its input arcs and produces tokens 
on Us output arcs. llKaeare three operator ty|>c*a^e^r^csp^mdingnilcs defining their operation or 
.M'gM^w. 'Hie graph in F^ 

tfejcptiwnjdscg 
contains instances of each type. The exp node is 30 abbreviation for a VaL expression representing the 
predicate of the conditional ihtis, it shoukj evaluate to a btx)kanvahie. 

The most generalized operator type fe< die functional .operator, rcpfesented in the figure by 
nodes / and g. Ilicsc operators may perform simple arithmetic iterations such as addition or 
multiplicatit)n, or more complex functions such as square root. llic firing behavior rule for functional 
operators specifics that a token be present on each input arc for the operator to fire, at which time all 
inputs arc absorbed, the appropriate function i$ computed and a result token is produced on each of the 
operator's output arcs. 

The lots and false control gaj£S represented in Figure 1.1 by the T t and F nodes form a second 
operator type. Fach of these operators requires a control and a data input to fire, and operates 
according to the following rule: If the control input matches the gate typcv the datainput is transmitted 
to Ac gate's output arc, otherwise the input data token is absorbed and no output is produced. Thus, a 
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Figure 1.1. Data flow graph of the VAL cxprcssioa '% pxptfoenftteeg' 
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T gate (F gate) will transmit its input data token to its output arc if and only if it receives a true (false) 
input control token. 

The remaining operator type is the M gate, or merge control gate , which has three inputs; a 
control input, and two data inputs corresponding to Jme. and false control input values. To fire, an M 
gate requires an input control token and corresponding input data token which is then transmitted to 
the gate's output arc. A value present on the input data arc not selected, is unaffected by the gate's 
firing. Appropriately, the M gate merges two paths in the graph. Thus, Figure 1.1 models the 
conditional construct behavior by allowing an input token to flow through cither the T or F gate, (based 
on the evaluation of exp), to the M gate which merges the true and false paths to produce a result token 
on the graph output port 
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1 3 Translation of VAL to Data Flow Grapfcs 

While data flow graphs expose concurrency inherent in a a>mputation by explicit 
representation of operator dependencies, it is impractical to express programs in this form. Instead, we 
introduce the high level data flow language Vai., acronym for value-oricntcd algorithmic language, and 
a translation algorithm mapping Vai, programs into data flow graphs. Developed by Ackerman and 
Dcnnisp] as a source language for data flow graphs, Vai. is^an applicative language containing 
constructs well suited for expressing parallelism in aerogram. A RNF specification of the syntax of a 
subset of Val, used in the development of this thesis follows. 

exp :: = id | const | exp. exp\ opei{exp) \ let idlist = exp In exp | 
if exp then exp else exp | for idlist = e£g do ilerbody 

ilerbody : : = exp\ kwexp [let idlist = exp in ilerbody \ 
U exp then ilerbody else ilerbody 

«/:: = "programming language identifiers" 

idlist ::= id {, id] 

const : : = "programming language constants" 

oper : : = "programming language operators" 

The recursive translation algorithm mapping Val expressions into their data flow graph 
implementations, defined by J. D. Brock(3], consists of the functions T and Ti which respectively map 
Val expressions and iteration bodies into their graph implementations. Both functions produce graphs 
which have an input port for each free variable in the expression or iteration body being translated. 
Tfex/>] has an output port for each value returned by the expression; T^[iterbody] has two sets of 
output ports, I and R, used respectively to re-iterate or return a set of values, and an output port iter? to 
signal which possibility has occurred. Translations of the conditional and iteration expressions are used 
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cxtcnsivcly in this thesis, and arc shown in Figures 1.2 and 1.3 respectively. 

Functioning of the conditional expression in Figure 1.2 should be clear from the discussion of 
Figure 1.1. [''valuation of T[<\xy>jl should produce an input control value for all gates in the graph, 
allowing tokens to flow through either the T or F g^tes, aiaMiMg-compulation of the graph represented 
by T[f.)r/^2l oi' n^^] respectively. The iteration expression of Figure 1.3 is formed by using M gates 
to merge die values resulting from evaluation of cx^, with the iteration, I, outputs of T^iterbody]. Hie 
control input port of each M gate is connected to^hc iter? output \ofT ^iterbod\], initialized with a 
false token to ensure that selection of the first set of data valuqs isjfrom Tl<>jr/;J. A set of data values 
will be iterated as long as successive iter? outputs arc true and will be returned at the first instance of a 
false iter? output, which reinitializes the M gates. A more detailed explanation of the application of the 
translation algorithm to die conditional and keratioji expressions, as well as to $\c remaining 
expressions specified in die Vaj. subset defined above, can be found in [3J. 



Figure 1.2. T[if expi then expj else exp-, end) 
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Figurc 1 J. T|for idlisl = expio tierbodymi] 
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A major concern in generating data flow graph implementations of Val expressions is ensuring 
correct modelling of the semantics of each high level construct In fact the translation algorithm is part 
of a two step process giving the operational semantics for the Val subset: The operational semantics :of 
a data flow program is a formal modelling of the execution of the program's data flow graph. The 
operators composing data flow graphs arc detennimte, meaning that every complete set of inputs to an 
operator (one for each input port) produces a unique set of outputs. PatflpSJ proved that if the 
operators of a graph arc determinate, the graph itself is determinate. Developing operational semantics 
for Val is possible due to the determinate nature of its corresponding data flow graphs. Thus, a 
complete set of inputs to a data flow graph will produce a unique set of outputs, making it necessary to 
examine only one execution sequence of a graph to derive the result of its execution. ITie graphs in this 
thesis are generated from Brock's translation algorithm and are therefore assumed to be correct 
semantic representations based on the operational semantics developed in [3J. 
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1.4 Safety Transformations for Data FlowGraphs 

Though wc accept tlic data flow graphs generated by the translation algorithm discussed in the 
previous section as theoretically correct, their arcs arc assumed to be infinite queues ~ this prevents 
their realization. While it might be possible to implement the graphs using sufficiently large finite 
buffers, this solution may not be acceptable. To examine the problem, consider the state of the graph 
shown in Figure 1.4. The token configuration shown can be reached by assuming that the graph occurs 
within an iteration construct which recycles the output of the construct ITie second set of inputs shown 
could therefore have been generated in response to the output resulting from the first set of inputs. 
Assuming that the output of this first set was produced by propagating tokens through the false branch 
of the graph, it would be possible for the corresponding T gate inputs (tokens labelled I) to still be 
present when the second set of tokens arrives, creating the computation state shown. 



Figure 1.4. Unsafe token configuration resulting from infinite queue arcs 




out 
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While an implementation of graph arcs as buffers of sonic constant size (greater than one) 
could accommodate this configuration, me design of a number of data flow architectures, including that 
of the Dcnnis-Misunas data flow machine, cannot support this: The correspondence of graph arcs to 
machine registers in such designs makes it necessary to consider only those execution sequences which 
place at most one token on any given arc at any time. In the Dcnnis-Misunas data flow machine, the 
consequences of placing more than one token on an arc or correspondingly, computing a successive 
register value before k can be stored, arc possible nondctcrminism, and deadlock as a result of values 
qucucing up in its distribution network and blocking other values from reaching their destinations^]. 
Meeting the one-token operational requirement involves preventing data flow operators from 
producing new tokens until their output arcs arc empty. This behavior is achieved by defining the 
following firing rule for all graph operators: 

Operator Firing Rule: An operator is enabled to fire when all of its needed inputs are 
present and all of its output arcs arc empty. 

Application of this rule prevents the Figure 1.4 state from occurring. 

While the operator firing rule defines the desiredftoken behavior, the problem of 

implementation remains. By performing a transformation which replaces each arc of a data flow graph 

-^ — ,_-■' 

by an appropriate data/acknowledge arc pair (d/a arc pair); the graph's infinite queues are replaced by 
buffers of capacity one, and the opcratprflriag rule is cxp^cifly built into the graph. This is illustrated 
in Figure 1.5, which shows the transformed conditional construct of Figure 1.4. The transformation 
creates arc pairs which hold either & data or acknowledge token, where the later indicates that its 
corresponding data arc is empty. With the addition of acknowledge arcs and tokens, firing rules revert 
to their original specifications which depend onty on The presence of tokens on input, including 
acknowledge, arcs: The operator firing rule requirement that output arcs be empty is ensured by the 
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I'igurc 1.5. Transformed Figure 1.4 
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enabling condition that acknowledge inputs be present 

The keyword used in describing this transformation is safety, where the underlying idea and 
the terminology is rooted in Petri net theory. Chapter 2 discusses the analogy between data flow graphs 
and Petri nets, and the influence of Petri net theory on the safely transformation. Included in the same 
chapter is a more detailed description of the transformation, and a consideration of ks eflfeet on- the 
correctness of graphs. 



1 .5 Optimizing Transformed Dala Flow Graphs 



While the transformation of data arcs to d/a arc pairs enables the implementation of data flow 
graphs, it is imperative to question the cost of the acknowledging scheme and determine the 
inefficiencies, if any, that arc introduced. In fact, there is much to say concerning these issues. Aside 
from the obvious overhead involved in incorporating acknowledge arcs and tokens, the constraints 



15- 



which they impose on graph operation may cause bottlenecks. In response fo this, we have developed 
optimization techniques which focus on decreasing overhead and increasing graph throughput The 
optimization to eliminate unneeded acknowledge arcs is aimed at decreasing overhead, thereby reducing 
the cost of the transformation scheme. An analysis of data flow graphs of Val constructs indicates that 
the effect of certain acknowledge arcs arc realized by the graph's <joatrol structure, making the arcs 
unnecessary. On the other hand, increasing throughput, the goal of the optimization to balance token 
flow, is accomplished by introducing additional identity actors into the graph and consequently creating 
more d/a arc pairs. ■' r 

Note that though the term "optimization" rjiay take on a variety of meanings, our use of the 
word is confined to the d/a arc pair transformation described above: Both optimizations consider the 
number of acknowledges used in data flow graph translations. We do not consider program dependent 
optimizations which might typically involve modification of a graph's structure, i.c, removal of 
unnecessary data arcs or operators. This latter form of optimization is analogous to standard 
optimization techniques for conventional sequential programs and, though not yet fully explored, 
should prove readily adaptable to data flow. 

1.6 Structure of Thesis 

Having established a foundation, we proceed to consider the main tasks identified. Chapter 2 
expands on the safety transformation introduced in suction 1.4, and discusses metaled rekftant theory. 
Chapters 3 and 4 respectively contain a development of the optimizations to balance token flow, and 
eliminate unneeded acknowledge arcs. Conclusions arc presented in chapter 5 along with suggested 
areas for future research. 
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CHAPTERTWG 

2.1 The Safety Transformation 

'ITic aim of the data/acknowledge ait pair transformation of data flow programs is to 
implement the operator firing behavior, defined in chapter 1, and restated here: 

Operator Firing Rule: An operator is enabled to fire when all of its needed inputs are 
present and all of its output arcs are empty. ' 

This rule reflects the correspondence of data flow graph arcs to machine registers, which requires that 

the occurrence of more than one token on any are be prevented: Restricting data flow graph behavior 

in this manner is* accessary to ensure determinate and deadlock free execution for the architecture 

assumed. 'Flic analogy between the data flow graph characteristics of detcrminacy and deadlock and 

the Petri net theory properties of safety and liveness suggests the use of Petri net theoretical results to 

formulate and verify the d/a arc pair transformation. In fact, the strategy taken in developing the safety 

transformation is to extract relevant Petri net- concepts and redefine them for data flow graphs. 

This chapter proceeds with a closer look at the data flow graph - Petri net analogy, particularly 

focusing on the possibility of modelling the former with the later. Section 2$ expands on the safety 

transformation and its effect in guaranteeing determinate (safe) and deadlock free (live) operation. 

While showing the existence of die former is straightforward, a sigaificant question concerns whether 

or not the restrictions imposed to ensure safety affect kveness. 
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2.2 The Petri Net - Data Flow Graph Analogy 

2.2.1 History and Analogy 

The major contribution of Petri nets is to aid in understanding systems, A closer look at the 
components of Petri nets seems an essential first step. As shown m the Figure 2.1 example, a Petri net 
is a graph composed of transitions and places with an initial marking determining the number of tokens 
(pieces of data) residing on each place. The transitions and-piaces correspond respectively to data flow 
graph operators and arcs.. A token must reside on each input place to * transition for it to be enabled for 
firing, where firing the transition causes a token on each input place to be removed, and one to appear 
on each output place. Figures 2.1(a) and (b) respectively shew die Petri net toten configuration before 
and after firing transition tk The operation of a Petri net is ao^ if it behaves according to the following 
definition: 

Definition. For a marking M, a Petri net is safe if for every marking M' that can be 
reached by a sequence of firings from M, ftcie is at irtost one token on any place. 

This is precisely the behavior that we would #kc data flow graphs- to safcfy. Note that Ae Figure 11 

graph is, in fact, not safe since the sequence of transition firings: tl:t4;tl< will place two tokens on place 

P* 

We briefly survey the evolution of Petri nets to introduce this theoretical results that could 

prove applicable to data flow. Petri nets woe initially presented by Petri in W62f26J and modified by 

Holt in 1%8 [15]. Extensive study of safety and liveness for Petri nets of the marked graph and state 

machine varieties has been done by Holt and Commoner [16]. Each of these classes form a particular 

subset of free choice Petri nets. This work has been extended by Michel Hack [14] to include free 

choice Petri nets. Hack introduces production schemas, similar to data flow graphs, and asserts that 
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l-'igurc 2.1. Petri net token configuration before and after transition tl firing 
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every production schema can be represented by a free choice Petri net A major result known as the 
livencss-and-safeness theorem states circumstances under which a free choice net displays these 
properties. We explore the possibility of using such a result in producing determinate and deadlock 
free data flow graphs. Guaranteeing safety for free choice Petri nets involves ensuring that every place 
is part of some directed cycle containing one token. This fact should prove useful in determining if a 
data flow graph is safe, or in modifying it to be safe: We seek a modelling of data flow graphs by free 
choice Petri nets which allows us to conclude that a data flow graph is safe and live if its corresponding 
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Petri net is safe. 

2.2.2 Modelling Data Flow Graphs with Petri Nets 

The data flow graph firing behavior requirement that no arc ever hold more than one token, 
forces us to focus on the correspondence of data flow graph arcs to Petri net places. Were the 
correspondence of places to arcs 1-1, showing the Petri net mddcl places «ttV would prove the data flow 
graph arcs "safe". Unfortunately, this is not always the case, as is seen in modelling data flow graph 
control structures. 

Consider the graph of die conditional construct in Figure 2.2. Evaluation of the predicate 
results in enabling either the T, or Fgate which respectively determines whethar the input data value x 
will be processed by./7 or J2. A free choice Petri net model of this-data flow graph must enable a token 
to precede down one of two paths to reflect the two branches ofihe conditional and must merge the 
paths. A possible model is shown in Figure 2.3. Places and transitions corresponding to particular arcs 
and operators in the data flow graph arc so designated. In comparing the decision structures of the 
Petri net model and data flow graph, note that place aa' in Figure 2.3 represents two arcs in the data 
flow graph. Although the mapping between places and arcs is clearly not 1-1, the Petri net decision 
structure presented is essential for allowing a token to take one of two paths. Unfortunately, this makes 
it more difficult to determine how properties of place va? correspond to those of arcs » and a'. 

A significant difference in the actual control structure is the absence of specific places and 
transitions in the model to represent the data flow graph predicate and its output control arcs. Whereas 
the decision concerning which branch of the conditional construct will be executed is uniquely 
determined by the output of the predicate, the Petri net is nondeterminislic, providing a model for aU 
possible decisions: Though each token arriving at place aa' will cause only one path of the Petri net to 
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Figure 12. Conditional construct data flow graph 




Figure 13. Petri net model of Figure 12 data flow graph conditional construct 
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bccomc active, both paths arc potential candidates, litis situation emphasizes the use of Petri nets as 
general models for specific systems - in this case, data flow graphs (221. To remedy the modelling 
problems of the Figure 2.3 Petri net, a mow specific model shown ifr Figure 2.4 is built which attempts 
to localize the nondctcrminism in an added portion of the Petri net meant to represent the predicate 
and control arcs of the data flow graph, llicbchavior of the Figure 2.4 transitions modelling the data 
flow graph T and F gates is consequently deterministic, since firing is now dictated by the portion of the 
net labelled "predicate evaluation". A token on pfaee aa' will Enable cither the T or F transition. 

Figure 14. Petri net model of Figure 2.2 
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thcrcby determining its path. 

lliough this Petri net modelling of the conditional construct more accurately captures the data 
flow graph behavior, the portion of the net representing the T and F gates violates the structure 
defining the free choice subset of Petri nets: If a transition following a particular place is firable at a 
marking M, then all transitions following that place arc firable at M. Informally, the definition of a free 
choice Petri net states that every arc from a place must be either the unique output of the place or 
unique input to a transition. Thus, the configuration involving place aa' and the T and F transitions in 
Figure 2.4 violates the free choice property. Since free choice nets ftrnn the largest subset of Petri nets 
for which a developed theory of liveness and safety exists, there is no advantage to pursuing this 
modelling route. For this reason we change direction* attempting to accomplish our goals more 
directly by extracting the relevant concepts of Petri net theory and redefining them for data flow. 

2.3 The Data/Acknowledge Arc Pair Transformation 

2.3.1 Achieving Safe Data Flow Graph Operation 

Since the Petri net properties of safety and liveness reflect the behavior we want data flow 
graphs to display, we attempt to redefine these terms for data flow via the correspondence of arcs and 
operators to places and transitions. 

Dcfinkioa For an initial configuration of tokens^ a data flow graph is safe if every 
configuration of tokens that can be reached from me initial configuration contains at 
most one token on any individual arc. 

Definition. An initialized data flow graph is live if a complete set of inputs will 
eventually cause a complete set of values to appear on the output arcs of the graph. 

To ensure safe operation in Petri nets, every transition in the net must be part of a one-token directed 

cycle. Adapting this for data flow is accomplished by introducing initialized data/acknowledge arc pairs 
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(d/a arc pairs) and ensuring that every arc in a data flow graph is part of such a pair. 

The mechanics of the transformation illustrated hi Figure 2.5 involves replacing each full data 
arc with an arc pair composed of a full data arc and empty acknowledge arc; and each empty data arc 
with an arc pair composed of an empty data arc and mil acknowledge arc. Alternatively, Brock's T 
algorithm can be modified to produce graphs wiUi d/a arc pairs, rather than infinite queue arcs. We 
distinguish the two by terming such an algorithm ffo, as opposed to T^. The Figure 2.5 graph 
segment labelled, "prc-firing state" represents' the transformation of the graph segment to its left. 
Having defined mis transformation we must verify that, in fact, it accomplishes Hs intended function - 
to ensure me safety and liveness of data flow graphs. 

An initially transformed graph h potentially safe since each of its arc pairs holds only one 
token. What must be shown is the presewmion of th&praperty «^ 

the Figure 2.S graph segment, OP1 is the only enabled operator since it is the only operator which has 

.. >■■■■<■■ ■'■■'■■■r- r r' m'l . . '» : ;; '.■- •;;,. "'■:■:.-■' .■'»,' ?' . 

Figure IS. D/A arc pair transformation 






prc-firing post-firing 

state state 



' v ? * ■ ■»»■ ' data arc 

*• ack. arc 

• data token 
O ack. token 



24- 



tokcns present on each of its input arcs. Fifing OP1 produces tlw post-firing state shown. The firing 
action results in the absorption of a token from each ofWt's input arcs and the production of a token 
on each of its output arcs. Consequently, OIM is disabled; and OR2 becomes the onry enabled operator. 
More importantly, GP1 cannot bctecnabted until k receives %©& a data, and an acknowfcdge input, 
where the appearance of the later is dependent on firing >©P2: Firing OP2 will absorb its input data 
token and produce an acknowledge token, input to GPi. Thus, OPI's output data arc must be empty 
for it to fire a successive time, producing a new data output TWs reasoning, shows the firing behavior 
dictated by the data/acknowledge arc pair transformation safe. 

2.3.2 Preservation of Livcness 

Verifying livencss of data flow graphs under the d/a arc pair transformation is more difficult 
Due to its determinate nature, a result obtained from a T^, gtaph wiW-match that of its corresponding 
Tqq graph: An y ^d/a graph firin & scl P cnee i« a legal Wmg sequence itt tire Tqq graph. The question 
to address is mere/ore, whether the firing rule constraint causes some T^, graph to deadlock that 
would not have done so in its T^ version. 

The intuitive feeling that Tqq graphs and tlicir corresponding T^ graphs produce the same 
results is established via the theorem stated below, its p* oof consists of a structural induction on the 
size of data flow graph expressions. By asserting an induction hypo&esis fer expression subgraphs, we 
show that the livencss property holds for T^/. A graphs composed of acyclic interconnections of exp 
subgraphs, or graphs whose top level is a conditional or iteration expression. 

In analyzing the Tjy a iteration expression, we have to make some assumption about the 
behavior of its ilerbody operator which represents an iteration subgraph. Recall that the Tj translation 
function produces iterative graphs which have one set of input ports and two sets of output ports 
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through which values can be iterated or returned, as well as a control output port to signal which of the 
two occurs. The behavior of the ports of an iterative subgraph within a well-formed live Tqq graph can 
bo characterized as follows: When presented with n sets iof hiduis, the subgraph witt produce n iter? 
control values- k {rji£ (0£k £n) and n* fafe& and correspondingly^ sets of I date values and n* sets 
of R data values for a total of » data output-sets. To prove hvencss for a! T^ graph containing an 
iterbodyopcfatf*, we must first show that the port behavior ef^-^ iterative subgraphs isithe same as 
that displayed by T M iterative subgraphs. This »$1 aHow us m assume the desired uerbody port 
behavior, an essential step in proving the expression live. ; , 

Proving the correct port behavior for T^ iterative subgraphs consists of a sunproof occurring 

;.">''-" ■. t - *■ -: ..' ■" I ■' ■■■,*"■:■*' ■■■'. '■■ y 

within the larger inductive proof. Since the iteration expression contains the only instance of an 
ilertedy operator, the sunproof should naturally appear just, prior to proving the T^., iterative 
expression Hve,j However, 4o stem confusion only a Statement of, die assumed Uerbody operator port 
behavior will be made: iAnoutHrjc «Tthc snfeppoof follows Ar«mdwflwe;preo£ ; finally, inherent fa 
this discussfon is the assumption that the equivalence of T^ and comspoochng T^ a graphs is being 
shown for graphs which are well-formed, where this term is defined as follows: < 



Definition. A well-formed data flow graph is derived from a syntactically correct Val 
program using the TootransJatkrtial^rithm. 

We proceed with the Uvencss theorem. 



Theorem: A well-formed live data flow graph will remain live under the d/a arc pair 
transformation. 



Stated in operational terms: 
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Any Tj /a graph corresponding to a well-formed live Tqq graph, when presented with n 
complete input sets will either: 

(1) have produced n complete output sets and absorbed n acknowledge sets on its 
output d/a arc pairs, and emitted n acknowledge sets on Mfttpttt d/a arc pairs, or 

(2) contain some enabled operator. 
Proof: 

Basis: A data flow graph consisting of a single functional operator will remain live under the d/a arc 
pair transformation. 

An initialized functional operator is shown in Figure 2.6. On receipt of a complete input set, 
die operator will be enabled and when fired, will produce an output token absorbing the acknowledge 
token on its output arc pair, and emit acknowledge tokens on its input arc pairs. Since the operator's 
output arc pair is the graph output arc pair, within finite time the output token will be absorbed and a 
corresponding acknowledge token supplied reinitializing the graph. If an nth set of inputs has been 
presented to the operator and an nth output has not appeared, then the acknowledge arcs of the input 
arc pairs must have seen their nth acknowledges, n-1 of which were produced by firing operator/ This 
implies that the state of the output d/a arc pair is one of .the. following; The data arc has its n-lst data 



Figure 2.6. Initialized data flow graph of a functional operator 
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value and the acknowledge arc is empty but has seen n-1 acknowledge tokens; ihc data arc is empty and 
the acknowledge arc is holding its nth acknowledge token. In the first case, within finite time the n-1 st 
data value will be absorbed and an nth acknowledge token produced rccnabling the operator. In the 
second case the operator is enabled- 

Induction Hypothesis: In response to an nth complete input set, an exp operator (expression subgraph) 
will either: 

(1) have produced an nth complete output set and absorbed an nth acknowledge set 
on its output d/a arc pairs, and emitted an nth acknowledge set on its input d/a 
arc pairs, or 

(2) contain some enabled operator. 
Acyclic Interconnection of exp operators 

Assume that the Figure 2.7 graph has been presented with an nth set of inputs and that it has 
not produced an nth output scL We will show that the graph must contain an enabled operator. 

Suppose the graph has produced j output sets where j<n, and the output arc pairs have had 
their jth data values absorbed, and are holding their j + 1st acknowledge tokens. This implies that exjt^ 
must have seen at least j input sets. Three possibilities arise. 

Figure 17. Acyclic interconnection of expression subgraphs 



* 



d/a arc pairs 
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Suppose expj has not yet seen its j + 1st input set Then by Ate- induction hypothesis, since 
cjr/jj has seen its nth input set and only emitted j output sets where j<n, exp^ contains an enabled 
operator. 

Suppose expj has seen part of its j + 1st input set 'ITicn by the induction hypothesis since exp-^ 
has seen its nth input set and not yet emitted a complete j+- tee-output set where j + 1 <. n, exp^ contains 
an enabled operator. ' 

Suppose exp-y has seen its j-Jtlst input set Then simte expj has its j+ 1st set of input 
acknowledges available, it has not produced a j+lst output set and, by the induction hypothesis 
contains some enabled operator. 

Conditional Expression 

The conditional expression is shown in Figure 2JJ. In its Tqq form, when presented with n 
inputs, exp± wW produce n boolean outputs; k tQffi where 0<k<n andn-kJalsc.. In response to this, 
the M gates will sec a total of n data input sets - k on their true data input arcs and a*k on their false 
arcs. These are merged to produce the graph outputs according to the n M gate control inputs (k ta&; 
n-k false > which correspond to die M gate data inputs. 

An important consequence of the d/a firing restriction is that once a control input value is 
presented to the M gate, a successive control input cannot appear on that control arc ^between a and 
the M gate) until the M gate fires to absorb the previous value and emit an acknowledge token. The 
implication of this is that a is prevented from firing a successive time to rccnabloaM gates in me graph 
before the output set corresponding to the previous control value has been produced. This in turn 
implies that only one input set will be within the branches of the conditional expression at any time. 
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Assume the graph has received an nth setiof inpute. Assomd further, that no operator is 
enabled within txp^ By the induction hypothesis or/rj must htr*c produced aft ndi output set; The d/a 
are pair between a and the M gate ean be in one of two states. Either the arc pair is holding its n*ist 
control value, or it is holding an nth acknowledge token. Assume die arc pair is holding its irtst 
control value. By the functioning of the graph described above; mis Buphcs that the b- 1st mput set is 
being processed. Since Ac graph has received its nth input set, this implies that the;? and F gates must 
have emitted an n4st set of acknowledges by firing in response to tharn-ls* set of inputs. We can 
assume as a result, thateithcr exp± or «^ becomes enafcled. lh/ H» induction hypo&csis, wimm finite 
time we will sec the n-lst output set on the appropriate exp output data ana and an nth sot of 
acknowledges on the exp input are pairs. This action enables the fcfc gates which when fired will 
produce an n-lst set of graph outputs and emit acknowledge tokens along its data and control input 
arcs. At this point, the arc pair between a and the M gate is in its second possible state- holding its nth 
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acknowledge. Note tliat if a is fired, which is now possible, the graph will be in, flic stale it was in when 
the arc pair between o and the M gate held its n-lst control value. Since within finite time the n-lst set 
of graph outputs will be absorbed and each graph output will hold an nth acknowledge, wc can repeat 
the above reasoning to show that an nth set orgraph outputs isigroduced. 

Iterative Expression 

We assert the following concerning the pdrt behavior of the iterbody operator: When 
presented with an nth complete set of inputs* die subgraph represented by iterbody will cither produce 
n iter? control values -- k tryg and n-k false, fcWteircspondjpgly, k sets of I data values and n-k sets of 
R data values or; will contain some enabled operator. 

The iterative data flow graph is shown in Figure 2.9. Wc can make the following observations 
concerning the functioning of the graph in its T d / a form. Note Jiat firing copy operator h causes each 
of the M gates to be presented with the next control input TheimpUeation of this is twofold: Operator 
I. cannot fire until every M gate has fired, absorbing its previous control input and emitting 
acknowledge tokens; tlic number of input sets processed by each M gate is? cither equal to, or one less 
than the number of control inputs that have been presented to cach,M;pte. The opc/ation of an 
iterative graph is such that a set of input values will be itccatcd u* response to UmMer? outputs «nul 
iterbody produces a false iter? output which signals return of the vatafes. We consider these two stages 
of Tj/ graph behavior -- iterating values and returning values, separately. Since the synchronizing 
affect of copy operator L prevents any interesting overlapping of graph input sets, it suffices to show 
that when presented with one complete input set the graph will produce an output set without 
deadlocking. Wc begin with the return case. 
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Figure 2.9. Iterative data flew fraph 
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Assume Her? produces a false value fly the first implication above, L Cannot present dieM 
gales with <his value until each has fired to acknowledge L and produce a data input to iterbody. Thus 
iterbody must see a complete set of inputs for the M gates to be reinitialized. The stated behavior of 
iterbody dictates that within finite time a complete set of return values will be produced in 
correspondence with the fy$c. Her?. Thus if the M gates are rctrtitialixcd, asot of outputs s guaranteed 
without the possibility of deadlocking. The possibk waysofaxteadlockoccurrir^ are considered in Ae 
kcrative path argument whkh fbHows. 



Wc proceed to show that a deadlock does not occur within the iterative path of the graph by 
assuming the opposite and reaching a contradiction, supporting the conclusion that an enabled operator 
exists within the graph. Assume that there exists some well-formed live iterative data flow graph which 
deadlocks under the d/a arc pair transformation. To see how the deadlock occurs wc apply the same 
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scqucnce of computation steps to a Tqq graph and its corresponding T^ a graph, until wc reach a state 
where there exists some operator which is enabled in Ihc Tqq graph and irot enabled in the Tj/ a 
graph. Thccausc of deadlock must be that an operator in the Tj /a graph has its inputs available, but 
cannot fire due to the presenceof a token on its output arc. Wc attempt to Ibeatcthis operator, which 
must be an M gate or a gate within ilerbody. We procced'to consider each case. 

Assume merge operator Mo is in such a state, and that it has its jth set of iteration inputs 
available. The token on its output arc, labelled q, must be used in producing the I iterative input value 
of some other M gate, say Mi. Since the Tj /a graph is deadlocked, one of two situations must exist: 

(1) The path taken by token q through itetbbdy to the I input of gate Ml is blocked 
(every arc is full). 

(2) Token q is input to some operator which lacks some input and therefore is not 
enabled. 

Assume (1). Recall from our preliminary discussion of iterative graph operation, that if token q 
was produced as a result of *c j-lst input sot, it will beiisedto produce the jth f input of some M gate 
which, according to the assumption, is blocked. Thus, the tokcrl currently residing on die f input to 
that M gate must be part of the j-lst input set or some set previous to the j-lst set. 'Ibis implies that the 
M gate has not yet fired j-1 times. But from our knowledge of iterative graph operation, this is not 
possible since firing copy operator L to present each M gate with a jth control input required tr& prior 
firing of each M gate a j-lst time sending j-lst acknowledges to L - a contradiction. 

Assume (2). Since firing L a jth time is only possible if each M gate has fired j- 1 times, it must 
be that a complete set of inputs to ilerbody is available contradicting the assumption that some input is 
not present 



-33 



Assume the disabled operator occur as a resuHof iterbotfysmd that its output arc is an I output 
arc. If the disabled operator has a jth set of inputs available, then they will be used to produce the 
j+ 1st I input of some M gate. The token on its output ace, must therefore be a; jth 1 input of that M 
gate. By the twofold implication stated above, the /act that *c disabted«opcratnr has ks jth inputs 
available implies that every M gate was presented with a jth control input and has fired cither j or j-1 
times. Thus the M gate which has its jth 1 input available, must have fired j-1 times. If we can show 
that this M gate is enabled, then within finite time it will fire, sending an acknowledge to the blocked 
operator. Consequently, in finite time there will be an enabled operator within ilerbody. 

We know that the M gate has its inputs available, so it can only be disabled if its output arc is 
full. Assuming this situation, thctoken on its output arc must be from the j-1* "np»* set »d> will be 
used to produce the jth input of some other M gate. But then wc know that within finite time the 
operator to which this token is input will fire since by the twofold implication, every M gate has fired 
j-1 times: This simultaneously, ensures, that the operator has its inputs available and has aft empty 
output arc. The acknowledge necessary to enable the M gate will he seat asa result of firing the 
operator. Thus, within finite time, the M gate and subsequently the blocked operator in iterbody will be 
enabled. 

It follows that if the T^ graph is well-formed and live, the corrcspoiKling T^ a graph is 
wen-formed and live. Q.E.D. 

The subproof concerning port behavior for iterative subgraphs is also inductive in that it must 
assume a behavior for iterative operators within subgraphs and then prove the behavior for the top 
level structures defining iterative subgraphs. The behavior to be shown has been stated above at the 
start of the section of the proof dealing with the iterative expression. 



34- 



ITic simplest iterative structures, exp and iter exp, arc shown in Figure 2.10. Since the iterative 
subgraph proof is within the inductive proof above, the induction hypothesis concerning exp subgraphs 
is valid. As a consequence, proving that the Figure 2.10 graphs satisfy die stated behavior is trivial. 
Hstablishing this fact for the conditional iteration body, if exp then iteration^ else iteration-^ is tedious 
and will not be presented. 

Having developed the data/acknowledge arc pair transformation and shown Tqq and Tj /a 
graphs equivalent, the task of determining the quality of this solution remains. Major concerns to 
investigate focus on cost and efficiency. Chapters 3 and 4 address these issues and present 
optimizations of die solution subsequently developed. Example graphs in the remainder of this thesis 
arc assumed to have been produced by algorithm Tj /a . ITicrcforc, though not explicitly shown, all 
arcs represent d/a arc pairs unless otherwise stated. 



Figure 110. 
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CHAPTER THREE 
3.1 Balancing Token Flow 

The optimization to balance token flow discussed in this chapter addresses certain 
inefficiencies introduced by the acknowledging scheme presented in chapter 2. Though the d/a arc 
pair transformation prevents the occurrence of more than one token on an arc at any time, the firing 
restrictions it imposes are severe, and may significantly curtail concurrency. Specifically, the 
requirement that an operator receive acknowledge signals on caeh of its output ports before refiring, 
unnecessarily delays computation of successive input sets. While ensuring the safe operation of the 
graph is essential, it is possible to identify which output arcs arc potential bottlenecks, and modify each 
so that it can be safely implemented as a fixed size buffer. 'ITie purpose of this change is to effectively 
enable arcs to hold more than one token, thereby eliminating bottlenecks by allowing computation of 
successive sets of inputs to "pipeline" through the graph. Safe implementation of these buffers involves 
the use of identity operators which, when inserted along an arc, act as place holders. Identifying arcs 
within a graph that may cause bottlenecks, and determining the extent to which they should be 
buffered arc prerequisites to their modification. While the former of these tasks is straightforward, 
deciding on a buffering strategy is subject to a number of considerations including graph configuration, 
and cost of buffering. 

A simple example is presented in section 3.2 which clearly illustrates the problem addressed in 
this chapter, and serves to motivate the subsequent optimization. This, discussion is formalized in an 
algorithm which produces optimized graphs. The section concludes by pointing out certain subtleties 
of graph operation and factors not accounted for in formulating the proposed solution. In response to 
this, section 3.3 introduces a modified version of the section 3.2 algorithm, along with several 



36 



comparative studies of graphs in their limited and ftrtly buffered configurations. 

3.2 Formulating (be Optimization 

3.2.1 Identifying the Source of Bottleneck 

The goal of the optimization to balance token flow through a graph is to increase throughput 
by modifying a graph to allow for maximum pipelining. The botdeneck problem, and therefore 
application of the optimization, arises in acyclic segments of a data flow graph. A clear illustration of 
the problem is shown in Figure 3. 1, the graph translation of the Vai. expression: 

if f = 1 then/7 eke J? 



Figure 3.1. Buffering for a conditional expression 
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The interesting and problematic issues arise when considering the consequence of presenting 
die graph with multiple input sets. HopcfuHy, processing of a^seeond set of inputs can begin before 
outputs of the previous set appear, with the optimum skuatjo* being one in which sets of inputs 
pipeline through the graph. Unfortunately, the control structwc «£ the graph dictates that the overlap 
in processing of successive sets of inputs be minimal: Onl&pQe set of values may be within the 
branches of the outer conditional at any time. Referring to Figure 3.1, we see that in order for a second 
set of values to enter the branches of the conditional, both a and fi must fire a second time presenting 
the sets of T and F gates with new control inputs. However, <r cannot fire a second time until the M 
gate to which it also sends a control input has fired to emit an acknowledge. Thus, the d/a arc 
connecting a and the M gate (marked in Figure ,3,1 by slashes), prevents sets of values from pipelining 
through the graph, creating a bottleneck whose severity depends on the depth of the computation 
performed within the branches of the conditional. 

Rliminating this undesirable behavior so that successive sets of values may pipeline through the 
graph involves finding a method of enabling node a sooner, consequently allowing the slashed arc to 
hold more than one token. The ideal situation would be one in which the arc could hold as many 
tokens as the number of sets of values that could be pipelined through the graph. 

3.2.2 Preview of a Solution 

Introducing identity operators into the graph provides a means of realizing the desired 
behavior. Specifically, inserting identity operators along the slashed arc (Figure 3.1) would break it into 
d/a arc pair segments, allowing node a to fire several times before forcing the M gate to fire. Using this 
technique on Figure 3.1 to attain maximum pipelining is accomplished by replacing the slashed arc 
with the arc segment shown to its immediate left. As a consequence of this change, the state shown in 
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Figurc 3.2 in which three sets of tokens arc pipelining through the graph, can be reached, (The token 
sets have been numbered accordingly for clarity.) Thus the mtrodtictioh of identity nodes has 
eliminated ^bottleneck: Gencraftzihg mis optimization techn^ife requires a detenhmation of the 
idea) number and location of buffers to be inserted. Torespontf to such considerations, wc attempt to 
analyze how tokens flow through the graph. 

3.2.3 Analyzing Token Flow to Characterize the Solution 



Though the data flow computer operates asynchronously and data flow programs 
nonsequemiatly; we can model optimum token flow through me graph by assuming a somewhat 
synchronous behavior. To do this, wc analy/cihe firmgswimhi the graph m terms of time units where 
during any given unirof time all enabled actors must Are and produce a result 'litis assumption 
attempts to approximate optimum behavior by prcventfrig ati enabled acMn4>ni rcmaifting enabled 



Figure 3.1 Token configuration allowed by buffering 
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and thereby slowing up processing for any length of fimfe R«ea11inf that our Mfn is to pipeline 
computation throughAe graph, we wish to develop a methodftf modifying the graph so that under this 
"synchronous behavior" assumption it displays maximum pipelining arid consequently, best 
throughput " ; 

Referring back to Figure 3.1, we notcthat every input set to the graphrcsults in the production 
of a token on the control (slashed) arc, and tokens that will either be processed by fl o\fl. While under 
die synchronous behavior assumption the tokens being processed by these functional operators can 
move one step through the graph during every time unit, die control token on the slashed arc cannot, 
and must remain stationary until its corresponding tokens propagate through the graph to enable the M 
gate. As previously seen, the inability of the control arc to accept a second token prevents any tokens in 
a successive input set from being pipelined. The dependency between the control arc and the branches 
of die conditional, and the consequent need to equalize their buffering capacities to attain maximum 
pipelining has been recognized by the addition of identity nodes shown in Figure 3.2. An algorithm to 
equalize buffering along graph paths must be able to identify dependencies within a graph and pipeline 
their paths. This can be accomplished by an arc numbering scheme which compares and equalizes 
buffering capacities of dependent paths, recognized by identifying functional operators or gates which 
join two or more paths. An illustration of the algorithm which performs this optimization follows its 
presentation. 
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Algorithm to Maximiw Pipelining -I 

Starting from each graph inputs descend thnwgh the graph assigning consecutive 
numbers to arcs joining successive sets of operators until a multi-input operator is 
encountered. Compare the arc numbers on the input arcs of the operator and: 

(a) if equal, continue the arc numbering process 

(b) if not equal balance the arcs by inserting identity operators into 
the lower numbered arcs. Renumber the modified arcs and 
continue ihearc numbering process. 



Note that if the operator is an M gate, the comparison and balancing described above must involve all 
three input arcs, using the highest numbered arc as the goal. 

The result of applying this algorithm to the graph translation of the following program segment 
is shown in Figure 3.3: 

if f = 1 then if s= 1 then x*(y + 1) else x*(y-l) end ctse x*y end 
For reference purposes, the added identity nodes have been numbered. The seven numbers shown at 
the extreme left of the graph result from the arc numbering process, and apply respectively to 
appropriate arcs moving horizontally across the graph. Nodes 1 1 and 12 have been added in response to 
the imbalances which occur when comparing arc numbers on the input arcs to the multiplication 
operators. 13 through 15 arc added in response to the comparison of the input arcs to the inner M gate. 
Note that, as specified in the algorithm, arc number comparisons involve all three M gate input arcs. 
Finally, operators 16 through 115 arc introduced as a result of comparing input arcs to the outer M 
gate. 

One essential question to ask is whether or not the addition of identity operators changes the 
functionality of a data flow graph. This can be answered by recognizing that me essence of the change 
resulting from the application of Algorithm I is to replace some of the one-token arcs of a graph with 
queues of a given finite length. Since successive identity operators along the arc are separated by d/a 
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Figure 3.3. Example of maximal pipelining 

s 




arc pairs, the graph remains deterministic; and since an identity actor merely passes its input to its 
output arc, the functionality of the graph is unaffected. These observations ensure the functional 
equivalence of an optimized graph. 
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3.2.4 Observations 

In developing, this example, there are several interesting observations to make concerning the 
optimization and the specified algorithm. As stated above, the optimisation is accomplished by first 

i. 

identifying and then pipelining dependent paths in the graph. While dependencies detected at 
functional operators and T, and F gates can be handled as described, those resulting from M gates hold 
some hidden considerations. Recall from the algorithm that M gate cdmjarisons must involve the two 
data arcs and the control arc. The algorithm modifies 'the graph to achieve maximum pipelining by 
equalizing buffering capacities of the paths through the graph to the control arc and two data arcs. 
However, while the M gate signals the dependency ©f each branch (of the conditional operating in 
conjunction with the control arc, the branches themselves arc independent, Thus, while each branch 
must pipeline with the control path, they need not necessarily pipeline with each other. If the two 
conditional paths a?e of different lengths, the Duffering chokes available arc to equalize the control 
path with cither the shorter or the longer conditional branch, or to equalize all three. The latter of 
these, implemented by the algorithm above, achieves best throughput but has jthc disadvantage of 
causing the insertion of additional identity operators in the shorter conditional branch. Thus, 
maximum pipelining may be achieved at the expense of including a number of unnecessary identity 
operations. The other two choices recognize the independence of the two conditional paths and avoid 
excess buffering, but possibly at the cost of reduced throughput 

A factor not yet considered which interacts with this pipelining choice is the token distribution 
effect on the graph of a particular succession of input sets. In Figure 3.3 each input set can take any of 
three paths corresponding to the three possible states of /and s. This makes it unlikely that any one of 
the three paths will be filled with tokens, more likely that the control arc to the inner M gate will be 
filled and certain that a continuing succession of input sets will fill the control arc to die outer M gate. 
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If wc consider a pattern of input sets such that no one of the three paths is taken twice in a row, identity 
nodes II and 12 would be unnecessary and could be removed without decreasing the throughput In 
fact, many of the identity nodes could be removed with no«ffcc_L Certainly, the frequency with which 
graph paths arc taken is an important factor in choosing a buffering strategy. An illustration of this 
point will be seen in the examples in section 3-3.2. , 

In identifying some tradeoffs and options to consider in maximally pipelining data flow graphs, 
it has become unclear whether or not this approach is always optimal* Perhaps the advantages of a less 
pipelined graph arc.worth a decrease in throughput. Some key issues influencing such a decision might 
include cost of identity operations, processor utilization, tofcen, flow .patterns and width and depth of 
program. Though complete consideration of these woufd require knowledge of the machine and 
particular application, wc attempt to illustrate the type of analysis that might be useful and necessary in 
making the choice. 

3.3 Full vs. Limited Buffering 

3.3.1 Achieving Limited Buffering 

Having questioned whether fully balancing a graph is always necessary or optimal, wc proceed 
by comparing several graphs in both their limited and fully buffered versions to uncover the tradeoff 
issues. A discussion of limited buffering including how it can be achieved and to what extent T,j/ a 
graphs display it is a necessary preliminary. 

The difference between full and limited buffering in a data flow graph is seen in the time delay 
between successive firings of its operators. In a fully buffered graph, assuming synchronous behavior, 
the time delay between repeated firings of any particular operator should be one unit: An operator 
which fires at time one should receive acknowledges from its successive operators during time unit two, 
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rccnabling H to fire during time unit three. In a graph displaying limited buffering, the delay between 
an operator's firing and receiving appropriate acknevtfc^ s%nate may ,: be several time units, thereby 
slowing repeated firingsof the particular operator as wetl as aH successive operators. 

Presently, the T^ translation algorithm produces data Row graphs in which every data arc is 
paired with an acknowledge arc. Wc could however, have considered an algorithm which caused 
acknowledge arcs to span two data arcsrby having each acknowledge arc link aherhatc rather than 
successive operators. The consequence of such a scheme would be a delay m the sending of 
acknowledge signals and hence, a graph displaying limited buffering. While section 3.3.2 discusses an 
example data flow graph so configured, mis approach is imdcsfrabte since it rcqttlrcs a significant 
modification to the present translation algorithm. The necessity for such an action is also unjustified 
since in most cases* T^ graphs already display Hmncd buffering, as i dkl me Figure 33 graph which 
was modified to achieve full pipelining via Algorithm I. A slight revision of this algorithm win allow us 
to produce data flow graphs which display limited buffering to some predefined degree. For example, 
it is possible to specify that the delay in sending acknowledge signals be no greater man two time units. 
The algorithm shown below produces graphs meeting this rc^irtmcnt While ttic purpose of 
Algorithm I was to equalize buffering of dependent paths within a graph, the modification to the 
algorithm ensures that dependent path lengths arc within a specified bound. By allowing a graph to be 
easily reconfigured to display different degrees of pipelining, the algorimm provides a feasible and 
practical control method of studying varying levels of buffering in a graph. The modified algorithm is 
presented below as Algorithm II: 
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Algorithm lo Limit Pipelining— II 

Starting from each graph input, descend through the graph assigning consecutive 
numbers to arcs joining successive sets of operators until a multi-input operator is 
encountered. Compare the arc numbers on die input arcs of the operator and: 

(a) if the difference is less than or equal to 2, continue the arc 
numbering process 

(b) if the difference is greater than 2, insert identity operators into 
the lower numbered arcs to reduce the difference to 2. 
Renumber the modified arcs and continue the arc numbering 
process. 



An application of Algorithm II appears in section 3.3.2 where it is applied to the Figure 3.3 
graph. We are now prepared to proceed with several graph comparisons of full and limited buffering. 

3.3.2 Examples of Full vs. Limited Buffering 

This section presents two data flow graphs, in both their fully and partially buffered versions. 
The first example achieves limited pipelining by relinking acknowledge arcs between alternate actors as 
described in section 3.3.1 above, while the second example is modified for limited pipelining via 
Algorithm II. Our aim in each case is to compare die functioning of each example's graph 
configurations with respect to throughput, acknowledgement overhead, and overall concurrency. The 
following assumptions arc made concerning the graphs' operation: 



(1) Graph firings occur according to the "synchronous behavior" pattern described in 
section 3.2.3 

(2) All graphs arc produced by T d/a with data/acknowledge arc pairs used 
throughout 
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Wc begin with a simple example in an effort to establish some analysis guidelines. The 
program segment shown in Figure M fea composition ef binary spaa*** which, if produccd-by T^, 
should display full pipelining. Thus, there is no need to apply cither algorithm to this program 
segment. Rather, studying this graph in limited pipelined form will require its restructuring so that 
acknowledge arcs link alternate operators. The flow of tokens through the graph for multiple input sets 
can be followed using TabteJ.1. (For convenience, the operators in thegraprt have been numbered.) 
The initial state of the graph, given in Table 3.1 at time 0, shows inputs (IN^ available to OP1 and OP2, 
and acknowledges (A) present on all other arc pairs. Progressing through the table along the time axis, 
we see that at time 1, OP1 and OP2 fire and acknowledge (F/A), making inputs available to OP3, and 
producing acknowledges on their input arc pairs. During time unit 2, OP3 fires sending a result token 
to OP4, which conscquendy becomes enabled, and^adwHricdge tokens toX)Pl and 0P2. At the same 
time, a new set of inputs can appear on the input arcs to OP1 and OP2 so that they become rccnabled. 
In time unit 3, OP1, OP2 and OP4 fire, sending appropriate data and acknowledge tokens which enable 
OP3 and OP5. These then fire in time unit 4, enabling OP4 as wcH as OP1 and OP2 which, as in time 
unit 2, concurrently receive a new set of inputs. This time unit is significant since during it, the output 
resulting from the first input set is produced. Following through the next few time units shows that due 
to the acknowledging scheme, the best throughput possible for 4 fully pipelined graph is an output 
every second time unit: Outputs resulting from the second and third input sets appear in time units 6 
and 8 respectively. 

An examination of the table shows that once the "pipe is full", (time unit 3), the operator 
firings of the graph can be grouped into two alternating sets, and consequently, the graph's operation is 
characterized by two alternating states. SET1 consists of OP1, OP2 and OP4 firings, or those of the first 
and third levels of the graph shown in Figure 3.4. SET2 consists of OP3 and OP5 firings which 
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Figurc 3.4. Maximum pipelining in a simple data flow graph 
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Table 3.1. Flow of tokens for Figure 3.4 
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8 time 



IN inputs present 
F/A fire and acknowledge 
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compose the second and fourth levels of the graph. Using the fact that alternating levels of the graph 
fire concurrently, wc sec that the minimum number of concurrent operations (assuming a full pipe) is 
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thc number oflcvcls divided by 2. ITic maximum number is found by certfputhig the sum of the width 
of each firablc level for each of the two sets to dctcrmine-thc farger. For the Figure 3.4 graph, SFT1 
and SKT2 consist of three and two concurrent operations respectively. |$is information should prove 
useful in analyzing processor utilization. 

Having gathered these statistics, we proceed by considering Figure 3.5 which shows the same 

graph, but in its limited pipelined configuration. Specifically, acknowledge arcs link alternate rather 

than successive actors. Comparisons to the Figure 3.4 graph can be made by analyzing the information 

contained in Table 3.2, which follows the flow of tokenrthrough this graph. The initial configuration 

of the graph, specified in Table 3.2 at time 0, shows inputs present on OP1 and OP2 input arcs, and 

acknowledges available to OP3 and OP5. IXtring time unit one, OP1 and OP2 fire to enable OP3. 

Note however, that the OP1 and OP2 input arcs arc noi acknowledged at mis time as they were in the 

Figure 3.4 configuration. Acknowledgement of OP1 and OP2 is now dependent on OP3's firing which 

occurs during time unit 2, delaying the arrival of a new set of inputs until time unit 3. Firing of OP4 

which also occurs during time 3 enables OP5 which m fircto produce an output at time 4. Again, 

rccnabling of OP3 has been delayed to the time unit, 4, whenit receives aKacknowledgc frant OP5 and 

inputs as a result of OP1 andOP2 firing. Time unit 4 is significant in that an output is produced 

However, following the operation of the graph for Aree input «ts shows that me delay in 

acknowledging operators has reduced the throughput to an output every third time unit The second 

and third input sets produce outputs in time units 7 and 10 respectively. 

Analyzing the operation of the graph using Table 3.2, we sec that me acknowledging scheme 
allows every third level in the graph to fire concurrently, thereby partitioning the graph into three 
interleaving sets of operators. Referring to Figure 3.5, levels 1 and 4 fire together, as would levels 2 and 
5, and levels 3 and 6, were the graph to be extended. Corresponding respectively to these three groups 
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Figure 3.5. Limited pipelining in a simple data flow graph 
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arc three states, shown in Table 3.2. Were the graph to be presented with continuous sets of inputs, its 
operation would rotate among these three states. For this graph, the number of concurrent operations 
per state beginning with state 1 are: three, one, and one, (determined by computing the sum of the 
width of each Arable level for each of the states.) Using the "concurrent operations per state" statistic 
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shows that the Figure 3.4 graph alternates between processing three and two operations while me 
Figure 3.5 graph processes three operations every third time unit and only one during each of the 
intermediate two time units. The lower variance m the number of concurrent operations per state in 
the Figure 3.4 graph suggests that it will be more efficient %ith -respect to processor utilization. 
Consequently, the only main advantage of the limited pipelined configuration is a reduction in the 
overhead associated with acknowledge signals. 

A second more involved and more complete example, applies this analysis, to the Figure 3.3 
graph, which appears in its fully pipelined configuration. Note that unlike the previous example, which 
translates directly into its fully buffered state under T^ a , the production of the Figure 3.3 graph 
required the application, of Algorithm i v The most significj&nt point to note is the need to insert 15 
identity operators to attain full pipelining. This represents approximately a 50% increase in the number 
of operators in the graph, making t&fccost of identity operators vs. &e benefit of increased tfwoughput 
and concurrency an extremely importants issue !to consldei foMati actual data flow machine and 
application. 

Table 3 J presents a summary of the token flow through the fully pipelined graph (Figure 3.3X 
assuming the control token produced by the predicate test involving f is Jfug. For each time unit, the 
level of operators firing rather than the particular operators wffl be specified, where the assignment of 
levels to operators is indicated in Figure 3.6. The total number of openUots fojjcach level as well as 
their breakdown m terms of inscTKd Identity operators as opposed to graph operators (an others) is also 
given. Thus, referring to Table 3.3, the second line states that during time unit 1, the first level of 
operators fired, all four of which were graph operations. During time unit 2, the second level of 
operators fired, one of which was an identity operator and five, graph operators. From the previous 
example, we know that successive sets of inputs will step through the graph with alternate levels firing 
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Figure 3.6. Fully piptlincd data (low graph 
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concurrcntly to produce an output every second time unit. In terms of the tabic die behavior 

corresponds to the alternate firing of even and odd levels, where for each of these firing states, the total 

number of operations and their makeup are: 

ODD 14 operations - 5 identity and 9 graph 
EVEN 16 operations -- 5 identity and 11 graph 

The Table 3.3 summary is only valid for two of the three possible /*and. estates: true - true and truc-falsft . 

A separate analysis is necessary for the case where f is fjMs g. 

As in the previous example, we wish to compare these statistics with an analysis of die 

functioning of the graph in limited buffered form. The appropriate <graph shown h* Figure 3.7 is 

obtained by applying Algorithm II rather than Algorithm 1 tofthe'T^ graph translation of the 
expression: 

if f = 1 then ' If s= 1 then x*(y + 1 ) die«$H}«Ml -«h*i*y end 
The most striking contrast between the fully buffered graph (Figure 3.3) and this partially buffered 
version is the large reduction in inserted identity operators from 15 to 7. What remains to be explored 
is whether the cost of this reduction is an accompanying decrease in performance (sec also [27B- To 
determine this, we examine several token flow analyses for the Figure 3.7 graph, derived by considering 
different successions of input sets. The first example performs the analysis for four sets of inputs which 
all follow the same computation path; Irij£-ID|£. The progression of tokens through the graph can be 
followed via Table 3.4. The numbers in each box in the table represent ithe specific operators which fire 
during mat time unit (given by the horizontal axis), as a result of tokens from the appropriate input set 
(given by the vertical axis), where the operators have been numbered as shown in Figure 3.8. Referring 
to this graph, Table 3.4 shows that, (assuming input set 1 is initially avaiable), during the first tine unit 
actors 1, 2, 3, and 4 will fire enabling actors 5 through 10 which will fire during the second time unit 
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Figure 3.7. Kxamplc of limited pipfrliniBg 

s x 




The second input set becomes present (P) during the second time unit so that operators 1 through 4 
may fire in response to this second set during the third time unit along with operators 11 through 14 
which fire in response to the first set In this manner, the progress ofthe four sets of inputs through the 
graph can be followed. The time units during which the corresponding outputs appear have been 
noted in Table 3.4 along the top horizontal axis. This information reveals the expected decrease in 
throughput which may or may not be acceptable depending on the application. 
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Figure 3J. Numbered Figure 3.7 graph to be used in conjunction with TiW«s 3.4 and 33 




As mentioned earlier, the probability of a succession of input sets taking the same computation 
path is snail. Therefore, a second analysis for this partiaHy pipelined graph appears in Tabte 3 5 
assuming input sots 1 through 4 take Ac computation oaths true-true, true-false, fatec and true-true 
respectively. The table reveals that for this pattern of input sets the tanked buffering scheme has aq 
effect on the throughput, which remains optimal at an output produced every second time unit Tins 
example confirms the point previously made concerning the significance of a sequence of input sets. A 
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ftirthcr analysis of input sets for this data flow graph may reveal that, in fact, it is rarely necessary or 
best to transform the graph into ftiljy buffered form. 

33.3 Additional Consideratiom 

Once an actual data flow machine is available, a study of the tradcofT of throughput for 
number of inserted identity operators should provide insight into the ffiffetioft to take concerning 
optimization. Perhaps this information in cotnbinatwn with a partfentof application will indicate other 
optimization possibilities; for mstence, concentrating cftbrts pir only tfrc main source of bottleneck 
within a graph. For the conditional construct this point appears to be the control arc to die M gate. 
Modifications of Algorithm I similar to the one which produced Algorithm 11 could also be weighed 
more realistically as alternative approaches. •, 

A final point to note in the consideration of this buffering optimization strategy is the type of 

. '■ ■: - '•■■■-. ■ A : - -■•, : -•- \ - 

construct for wrach it k appropriate. The examples above which in votvc conditional constructs and 
general compositions of operators, turn out to be fairly, rcpresentativet of me type, of graphs for which 
mis optimization is applicable. In fact this optimization ^ppreach K basically inappropriate for an 
iterative process whose function is to modify and recycle a single set of inputs at a time - a process 
which docs not involve pipelining (however, subgraphs wjtiun^»Ucratianrnay bepipeliRcd). For such, 
constructs, a different optimization technique must be developed. This alternative strategy, which aims 
to minimize the number of acknowledges in a graph by eliminating those which are unnecessary, is me 
topic of the next chapter. 
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CWAPTERFDUR 
4.1 Eliuiinaling Uonccded Acknowledge Arcs 

This chapter explores an optimization technique for removing unnecessary acknowledge arcs 
in a data flow graph. Though the uniform substitution of data/acknowledge arc pairs for data arcs 
yields a correct implementation of a data flow graph, the acknowledging scheme is costly. The 
overhead of processing acknowledge packets is felt in the routing networks and instruction cells of the 
data flow computer which must respectively handle the resulting increase in traffic and bookkeeping. 
Thus, there is value in questioning whether or not all acknowledge arcs arc needed. While it is easy to 
find example data flow graphs containing arcs for which an acknowledge is unnecessary, methodical 
identification of such instances is extremely difficult due to an often context dependent decision: The 
graph configuration and particular construct under consideration are key factors in determining 
acknowledge arc removal. In response to this fact, the strategy to eliminate unnceded acknowledge arcs 
focuses on individual Vai. constructs, attempting to identify candidate d/a arc pairs and provide a 
corresponding set of rules specifying conditions. Recursive application of the resulting set of rules to a 
data flow graph derived from a Val program can then be used to test each candidate arc pair for 
removal of its acknowledge arc. 

The following section considers the possibility of using Petri net theory to govern acknowledge 
arc removal, and subsequently discloses certain data flow graph operational characteristics important to 
the optimization process. Sections 4.3 and 4.4 develop acknowledge arc removal rules for the Val 
conditional and iteration constructs respectively. The later section includes several example graphs 
illustrating applications of the rules formulated for the iteration construct 
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4.2 Considerations for Acknowledge Arc Removal 

The concern in removing acknowledge ^acs fioei a data flow graph is whether the safe 
operation which the arcs ensure is maintained. Though we attempt once again to use Petri net theory as 
a guide, this strategy is discouraged not only as a consequence of the chapter 2 discussion, but as a 
result of examining T and P gate operators which display a fundamentally different behavior than that 
of transitions. A look at the operation of these gates and their effect on token flow shows the difficulty 
in using Petri Net theory, and motivates the formulation of new requirements for safe removal of 
acknowledge arcs in data flow graphs. 

The role of the transition in Petri net theory is analogous to that of the functional data flow 
operator: Firing a transition moves tokens on input places to output places of the transition. The T 
and F gate function which allows a computation to proceed in one of two ways, is accomplished by the 
Petri net configuration shown in Figure 2.3 and repeated below in Figure 4.1. The essential difference 
in the operation of this Petri net is that once one of its T, or F transitions fires to place the input token 
on a particular path, the transition controlling entrance to the alternate path is no longer enabled. In a 
conditional data flow graph, when the gates corresponding to the control input fire, the opposite gates 
remain enabled and must fire to absorb their inputs as is shown in Figure 42. 

Here the assumption is that the control input to the Figure 4.2 gates was true, allowing a token 
to flow tiirough the T gate to enable operator fl. The data flow graph behavior will allow an output to 
be produced at the M gate independent of whether or not the input presented to the F gate has been 
absorbed. This phenomenon docs not occur in the Figure 4.1 Petri net since an input token is switched 
down one of the two paths leaving no extra tokens behind The significance of this difference becomes 
clear when considering the possibility of iterative graph configurations. If we focus on the input arcs to 
the F gate, and view the Figure 4.2 graph as the body of an iteration construct which recycles its output 



-59 



Figure 4.1. Petri net model of the conditional construct 
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Figure 4.2. Conditional construct data flow graph 
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token, ensuring conflict-free operation requires that theinputarcsu* Che F gate bed/a arc pairs. 

Since the possibility of a similar conflict is absent from the Petri net modelling of the data flow 
graph, the difference in operation of the two renders Petri nets insufficient as a guide for acknowledge 
arc removal in data flow graphs. As a result, the applicability of Petri net theory to the process of 
identifying candidate arc pairs is limited. Instead, the strategy followed examines the various Val 
constructs to develop rules specifying conditions for acknowledge arc removal for each candidate arc 
pair identified in a construct 

An implication of this conditional construct behavior is that the acknowledge arcs of the input 
arc pairs to a T or F gate cannot be removed since the pretence of a token on an acknowledge arc is the 
only way to guarantee the absence of a token op a corresponding data arc: A T or F gate output arc 
gives no indication of the state of the gate's input arcs since firing may or may not produce an output 
token. An illustration of additional problems resulting from T and F gate behavior in combination with 
the possibility of nesting conditionals appears in the next section. 

43 Analysis orthe Conditional Construct 

To illustrate the analysis needed for finding removable acknowledge arcs we consider the data 
flow graph translation of a general conditional construct, shown in Figure 4.3. We begin by focusing on 
the slashed arc pair connecting a and the M gate. Recall that the behavior of this arc pair is such that it 
cannot accept a second token until the M gate fires to process the previous control token, and send an 
acknowledge token to a. Iliis guarantees that a second set of tokens cannot be within the branches of 
the conditional until processing of the preceding** has completed. While overcoming the restricting 
behavior of this arc pair was the aim of the chapter 3 optimization designed to balance token flow in the 
graph, it is an advantage to the process of removing acknowledge arcs as is seen by following an input 
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Kigure 43. T[if exp then/7 else/?] 




set through the graph. Kach input/set (proceed by cithcr/7 of J?); places a token on the control input 

'■• ■ JH- X. 

arc of the M gate and a data token on each of the arcs labeled either a and b, ore and d, depending on 

whether the control token is iryc or false . Assuming that./7 a^/?aW^Mikmcd, an output should 

appear on arc g (assuming the control token is triietwithin finite thrje: with no possibility of a second 

token appearing on arc g, or of any token appearing on are h until the M gate fires. This event 

simultaneously processes the token on arc g and sends an acknowledge token to o, consequent to which 

a successive input set may enter a branch of the conditional. The token flow behavior guarantees that 

the acknowledge arc of arc pair g can be safely removed, as can that of arc pair h (by an analogous 

argument). 
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One might be tempted fa) remove the acknowledge arcs from arc pairs a. b, c, arid d underthe 
assumption that once a set of tokens has entered a branch of the conditional, the tokens must be used 
by the appropriate function to produce the corresponding output However, a consideration of the 
Figure 4.4 data flow graph will show that removal of acknowledge arcs for these arc pairs is dependent 
on the subgraphs represented by ft and./?. 



Frgurc 4.4. Unsafe token configuration resulting from removal of c's acknowledge arc 




output 1 
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'ITic Figure 4.4 graph is a translation of the following Vai. program segment: 
if f = 1 then if s = 1 then x*(y + 1 ) else x end else x*y end 
Consider a set of tokens flowing through the graph which causes the outer predicate, f=l, to evaluate 
to irjjc. and that of the inner conditional construct, s= 1, to evaluate -to false . The tokens on inputs s, x, 
and y should appear on arcs a, b, and c, and eventually become the data and control input tokens to the 
inner conditional constructs T and F gates. Since the inner conditional's control token is false , the 
computation proceeds through its false branch. The important point to note is that continuation of the 
computation, only requires the tokens which appeared on arcs a and b. The token on arc c need not 
propagate through the graph, and may in fact still be on arc c when the outer M gate fires to produce an 
output and an acknowledge token, allowing the processing of a successive set of values to begin. Were 
a set of inputs to flow through the graph in this manner, removal of c's acknowledge arc would make it 
possible to reach the unsafe token configuration shown in Figure 4.4. ■(The tokens are numbered to 
indicate the input set to which they belong). This behavior is a consequence of T and F gate 
functioning, the foundation of the conditional construct structure. 

Understanding the analysis is aided by Figure 4.5 which generalizes the Figure 4.4 graph to 
expose the subgraph structure. The Figure 4.4 example shows that the necessity of acknowledge arcs 
for d/a arc pairs a through c is dependent on whether or not their values arc guaranteed to be used in 
producing the outputs of the appropriate subgraph (Jlorfl of Figure 4.5). -.Examining subgraphs- /7 
and J2, which respectively represent the inner conditional construct and multiplication operator of 
Figure 4.4, reveals that tokens arriving on arcs a, b, d, and e m ust be used to produce their 
corresponding output, while the need of a token arriving on arc c is dependent on the outcome of the 
inner decision operator. Therefore, c's acknowledge arc must remain but those of arc pairs a, b, d, and 
e can be removed. 
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Figure 4.5. Generalized vcjskmi of Figure 44 data flow grapk 




Ihis analysis; specific to the conditional construct, results m designating a!! input arc pairs to 
me fl or ft subgraphs subject to rule CI, shown in Figure 4.6, fir determining acknowledge arc 
removal. While flic rule serves to identify; and state conditions- under which certain arcs within the 
conditional construct may not need acknowledges, it gives no rrlc^itffbr testing' the conditions. This 
requires a recursive look at the constructs composing subgraphs ft and ft, the strategy just used in 
analyzing arc pairs a through e in the Figure 4.4 example. It is interesting to note that the analysis can 
be applied at the source level by first recognizing that subgraph fl was a conditional construct, and then 
taking the intersection of variables appearing in its then and else "clauses. Variables found in the 
intersection are guaranteed to be used in producing the output of me construct Therefore, arcs in the 
data flow graph corresponding to these variables should not require acknowledges. 

Finally, we look at the only arc in the conditional construct of Figure 3.3 not yet analyzed - the 
control (slashed) arc connecting o and the M gate. While the eliminafion of acknowledge arcs within 
our example conditional construct has been largely dependent on the existence of this controlling arc 
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I'igure 4.6. Acknowledge arc removal rules for the conditional construct 
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CI: The acknowledge arc of an input arc pair to subgraph fl or/? may be removed if 
any token arriving on the arc must be used in producing the output of the 
subgraph. 

C2: The acknowledge arc of the control arc connecting o and the M gate can be 
removed if die acknowledge arcs of the output arc pairs of the M gate has been 
removed. 



pair's acknowledge, its presence enables the acknowledge of an inner conditional constructs control arc 
to be removed. The argument to justify mis is the same as that used to explain the removal of arc g's 
acknowledge. Consequently, in the general conditional construct the control arc between a and the M 
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gate is marked as candidate for acknowledge ate removal, and is subject to rule C2 shown in Figure 4A 
This completes the analysis, necessary for performing the optimization to remove unnccded 
acknowledge arcs within the conditional construct As a second example, we discuss the iteration 
construct for which this optimization is particularly appropriate. \\ 

4.4 Analysis of the Iteration Construct 

44.1 Acknowledge Arc Removal 

The fact that the optimization presented in chapter. 3 is specific to acyclic segments of a data 
flow graph, emphasizes the significance of analyzing me iteration construct for annccded acknowledge 
arcs. Figure 4.7 shows the data flow graph translation of the Val iteration expression: 

lor kttist = expte iterbody mi 
The function of this construct M lb evaluate exp and men pcrfefen iterbody, which outputs an iter? 
control value and a set of data values on either its I (iteration) $<$. (return) output arcs, depending 
respectively on whether the Her? output value is ttye. or false . Successive evaluations of iterbody an 
made until a false, iter? value is produced, at which time evaluation of the construct with a new set of 
inputs can begin. 

11k function of the iter? arc is to provide the control value to the group of M gates which 
present successive sets of inputs to the iteration body. The arc is irittiatized with a false control value to 
ensure proper selection of the first set of data values. Assuming that the iter? value is dependent on at 
leajst some of the M gate inputs, a number of them must fire before a second iter? value is produced. 
This necessarily implies the firing of copy operator "L" in Figure 4.7, to present the M gates with iter? 
control inputs needed to enable mem - consequently ensuring that the //«? output arc of iterbody must 
be empty for a successive devalue to be produced. Asa rcsuH, the acknowledge are of this arc pair 
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Figure 4.7. Acknowledge arc removal rules for the interation construct 
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Tl: The acknowledge arc for an arc pair between operator L and the sequence of M 
gates can be removed if its data value must be used in producing the Her? value, 

T2: The acknowledge arc of an I (iteration) arc pair can be removed if either 

(LJThe iteration body cannot emit a value on that output arc until it has 
absorbed the corresponding input value on the corresponding input arc. 

(2) The iter? value depends on the corresponding input arc. 

T3: The acknowledge arc of a vj arc pair can be removed if the arc pair is not input to 
a T, or Fgate, and the iter? output value of iterbody depends on the Vj arc value. 
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(between iterbody and I.) can be removed. 

No such guarantee can be made for the afes between copy operator L and the M gates, since 
the iter? value need not be a function of every M gate input This implies the possibility of producing a 
second iter? value before every instance of the previous far? value appearing on the arc pairs between L 
and the M gates has been absorbed. Should L fire, unconditional removal of the acknowledge arcs of 
these arc pairs could cause a conflict Consequent^, acknowledge arcs of these arc pairs arc marked as 
conditionally removable subject to rule Tl, specified below Figure 4.7: M gates whose data vahie 
inputs arc used in producing the iter? control value rtiust fire (absorbing the current far? value, their 
control input) before a successive far? value is produced, «i)iJconse(|ucntly, need no acknowledge arcs. 

Kxamining the form of the iteration constructs iterbody is a necessary preliminary to 
determining acknowledge arc removal far the remaining arc pain in the iterative graph. Since the 
function of the construct is to iterate or retfim a set of values based on some boolean function, iterbody 
must contain a conditional. The BNF specification of Val confirms'this via the production: 

iterbody : : = if exp then iterbody^ else iterbodyj end 
Figure 4.8 shows the data flow graph translation of this conditional iteration body. Graph inputs are 
respectively presented to the subgraph representing cither iterbody^ or iterbody^ via T, or F gates, as a 
result of evaluating exp. The selected subgraph will produce a set of outputs at either its 1 (iteration) or. 
R (return) output ports according to its far? output value: tQj£ for I outputs; false for R outputs. The 
iter? output values of the iteration body subgraphs, along with the output of the predicate subgraph, 
exp, are the inputs to the FC gate which controls the graph output ports. The IC gate has three outputs: 
A graph iter?, and an I control value and R control value which provide control inputs to two sets of M 
gates respectively merging the I and R data outputs of the iteration body subgraphs to produce graph 
outputs. A more detailed specification of the IC gate is given in Table 4.1. Functioning of the 
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Figure 4.8. T|[if exp then ilerbady^ else iletbody^ end) 
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Functioning of the IC gate. 
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conditional iteration body is seen through several examples presented in section 4.4.2. 

By replacing ilerbody in the Figure 4.7 graph of the iteration construct with the Figure 4,8 
conditional iteration body to produce Figure 4.9, the I output arcs of the iteration construct can be 
analyzed for acknowledge arc removal. 
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Figure 4.9. Iterative data flow graph containing kerbodyaA^ntfit of Figure 44 

11 




Recall that a set of output values should appear on the I arcs for each flue, iter? value produced. 
The acknowledge arc of a particular I output arc may be removed if either of two conditions is satisfied. 
The first is the case in which production of the output value is dependent on the corresponding input 
value; appearance of a new value implies absorption of the previous value. At first glance this would 
seem to occur always. In fact, it is possible to produce a second output on some I arc without using the 
previous value, as is seen in the example in section 4.4.2. The second condition under which an I 
acknowledge arc can be removed is dependence of the iter? value on the corresponding I input 
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To understand this wc look at *e IC gate in Figui* 4.9, one of whose outptrt arcs is iter?. 
Firing the IC gate will produce values on two of its three output arcs; the iter? arc and cither the 
iteration or return control arcs which respectively provide control input values for M gates connected to 
the graph I and R output ports. Until the IC gate fires, these M gates will not be enabled. A set of 
values appearing on the graph I output ports therefore' requires the prior IC gate firing to produce the 
M gate control values, as well as an iter? value. It is clear that if this iter? value is dependent on a 
particular I arc input value, that I arc must be empty for it to receive a successive iteration value. 
Consequently, acknowledge arcs of I arc pairs satisfying this iter? dependence arc not needed. The two 
conditions under which the acknowledge arc of an I arc pair can be removed are summarized in rule 
T2, of Figure 4.7. 

To complete analysis of the iteration construct we discuss the input arc pairs to the iteration 
body labelled v:, in Figure 4.7. Testing for acknowledge arc removal must be done individually for 
each V: according to the following guidelines: If the arc pair is input to a T, or F gate, the acknowledge 
arc must remain: This follows from the discussion of T and F gate behavior. If the arc pair is input to a 
functional operator or M gate, the acknowledge arc can be removed if the iter? output of the iterbody is 
dependent on the v= arc value. The Vj arc pairs arc outputs of a set of M gates controlled by the graph 
iter? value. In order to remove the acknowledge arc of a particular vj arc pair, it is not sufficient that 
the V: value be needed in computing a successive iterative value in response to a true iter? output The 
vj value must also have been used before a new input value resulting from a false iter? value appears. 
This is ensured if iter? depends on the vj value. Rule T3 shown in Figure 4.7 states the acknowledge arc 
removal rule for the vj arc pairs. 
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4.4,2 Acknowledge ArcRcmoral » Iterative Programs 

To apply die acknowledge arc removal rules developed in the previous section, we begin with 

the simple but familiar factorial algorithm expressed as the following Val program: 

fori, y = 1,1 do 

ifi<nthcakeri+l,Y*itbey«nd 

The data flow graph representation of mis program is shown in Figure 4.10. The graph is composed of 
an iteration construct whose iterbody is a simplified form of the conditional iteration expression shown 
in Figure 4.8. The simplification occurs since only the then clause of the conditional iteration body win 
actually iterate values. Though both branches have the ability to iterate and return values, the tail 
recursive structure of the algorithm causes values to be iterated through one branch and returned 
through the other. 

If a set of. rules existed for each Val construct, determining which acknowledge arcs to remove 
for the fagojjal data flow graph would begin with analysis of the inner conditional iteration body. 
However, since we have only developed rules for the conditional and iteration constructs, we must 
leave the conditional iteration body as is, and proceed to the surrounding iteration construct 

Clearly, die acknowledge arc between the 1C gate and operator L can be removed. Rule Tl 
governs the arc pairs between L and the M gates. The i and n data values must be used in producing 
the Her? control value; therefore, only the acknowledge arcs of the arc pairs between L and the M gates 
controlling the i and n data values may be removed. 1 1, 12, and 13 (iteration) arc pairs satisfy the first 
condition of rule T2; a successive value cannot be produced on the I output arc until the corresponding 
input value on the corresponding input arc has been absorbed. Thus, none of these needs an 
acknowledge arc. Finally, we examine the V| arc pairs, which in the Figure 4.10 graph represent all six 
arc pairs emanating from the three M gates controlling the i, y and n data values. According to rule T3, 



-73 



Figure 4.10. Data flow graph of the factorial algorithm 

i 




only the two arc pairs input to the predicate of the conditional iteration body can have their 
acknowledgc arcs removed. The other four arc input to T and ¥ gates* making their acknowledge arcs 
essential. The results of this analysis arc shown in Figure 4.11 where each arc requiring an acknowledge 
arc has been marked with a double bar, ||; those not marked arc assumed to be single data arcs. 

While the factorial data flow graph shown t*> Signse 4JWs produced by the T algorithm, the 
simplified form of the conditional iteration body is significant in that the M gates which merge iteration 
and return values of the construct, though present, serve no function. The temptation is to optimize the 
graph by removing these M gates as well as the IC gate I aad J* control outputs. Though possible, rule 
T2 must be reevaluated as a direct consequence of this action since me analysis used to formulate rule 
T2 relies on the standard form of the cooditieeal iteration body shown in figure 4.8. Spceificallyy the 
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Figure 4.1 1 . Optimized factorial data Row graph 
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reasoning bchtadcasc (2>of ruJe T2 is dependent on the presence of inctand R M gates. We state rule 
T2 and proceed to reexamine each of its cases. 

T2: The acknowledge arc of an I arc pair can be removed if either: 

(1) The iteration body cannot emit a value on that output arc 
until it has absorbed the corresponding input value on the 
corresponding input arc. 

(2) The Her? value depends on the corresponding input arc 



Condition ( 1 ) of this rule still applies, since k describes die situation in which each successive 
iteration value is a function of its previous value. Clearly, only one value can appear on an arc which 
satisfies this condition at any time. Removing die M gales dees not affect this case. To reevaluate case 
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(2) of rule T2, we focus on. the data flow graph shown in Figure 4.12, the representation of the Val 
program: 

for i, y = 1, 1 do 

ifi < n then iter y + 1, i+ 2 else y end 
end 

This graph, similar in structure to die factorial graph, displays the same M gate phenomenon, but is 

significant in its reassignment of iteration variables. Each of these two variables is a function of the 

other: Iteration variable i is a function of y, and iteration variable y is a function of L 



Figure 4.11 Example data flow program 
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Itcration arcs of the factorial data flow graph satisfied case (1) of rule T2 - dependence of a 
successive value on its previous value, allowing their acknowledge arcs to be removed. Case (1) does 
not apply to the II and 12 arc pairs in the graph in Figure 4.12 due to the "crossover" reassignment of 
iteration variables. However, their acknowledge arcs can be removed since case (2) of rule T2 is 
satisfied: Production of the iter? value depends on both i and y. Variable i » needed W compute the IC 
gate control input and variable y generates the gate's tr ue , data input 

The structure of the Figure 432 data flow graph enables us to examine whether case (2) of rule 
T2 correctly determines acknowledge arc removal if the graph is optimized by removing its I and R M 
gates and IC output control arcs (portion of the graph shown in the dashed box). Consider the state of 
the graph shown in Figure 4. 13, the optimized version of the Figure 4.12 graph. 

It is now possible for a sequence of operator firings to place a successive value on 12, resulting 
in the unsafe state shown in Figure 4.14. Even though die IC gate is dependent on the y value, the 
production of successive iteration values is no longer dependent on the.pri'or firing pf the IC gate. 
Thus, the i value can propagate through the graph to produce a successive y value befprc the previous y 
value has been absorbed. We sec that as a result of optimizing the standard graph form^ the case (2) 
condition is no longer adequate for ensuring safe removal of itcratioa acknowledge arcs. 

One approach to this problem, is to specify this type of graph optimization as illegal. Such a 
restriction favors the removal of iteration acknowledge arcs over the removal of unnecessary operators. 
At the same time, it enables uniform application of the present acknowledge arc removal rule. A 
second approach involves redefining rule T2 for optimized graphs whose M gates have been 
eliminated. Removal of I acknowledge arcs becomes dependent on the predicate value rather than the 
iter? value. The functioning of the graph dictates mat data used in producing I or R values must come 
through the T or F gates controlled by the graph predicate. This ensures that M gates controlling 



-77- 



Figurc 4.13. Modified data flow program from Figure 4,12 




variables used in computing the predicate must fire before new Rotation values can be produced The 
modified version of rule T2, case <2) reflects thisanalysisby specifying lhat an iteration acknowledge 
are may be removed if its corresponding input arc must be used in producing the predicate value. 



T2: The acknowledge arc of an I arc pair can be removed if 

(1) The iteration body cannot emit a value on that output arc 
until it has absorbed the corresponding input value on die 
corresponding input arc. 

(2) the predicate output value depends on the corresponding 
input arc. 
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Figure 4.14. Unsafe token configuratkM for Figure 4.13 

i 




Using this rule, the ac&newledge arc of iteration arcpak 12 can ae/ be removed since computation of 
the predicate vatae docsnot involve ^itk: vartabtei^ntf^led by itfroorrespoitdtng input arc. 

This anatysis of the iacJiMJal algorithm cmr>ha$izes the options and problems which quickly 
surface in considering rather basic examples. The acknowledge arc removal rules, while adequate for 
graph configurations derived by straightforwardly applying the T algorithm, could require significant 
expansion to be compatibly used with other optimizations, A study Of more complex graphs or of 
those requiring this optimization in conjunction with other optimizations would be useful in 
determining the general applicability of these rules, and is designated asaa area of interest for future 
research. 
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CHAPTER FIVE 
5.1 Summary 

The aim of this thesis has been to address problems which arise in translating a high level 
language for a machine architecture designed for parallel processing. While the high level language is 
nearly indistinguishable from source languages for standard sequential processors, the data driven 
execution of its instructions requires a radically different form of translation. This study of data Jim/ 
translation uses the high level language Val and the Dennts-Misunas architecture. While standard 
methods of data flow processing do not yet exist, the model used reflects the type of translation issues to 
be tackled in the realm of data flow. The problems unveiled and solutions proposed are illustrated 
using data flow graphs^ which result from applying the T translation algorithm to Val programs. 
Though these data flow graphs closely correspond to the machine language representation of Val 
programs, their level of abstraction and explicit rcpresentioit' "-of data dependencies make them a 
generally accepted model of data flow. 

Chapter 2 focuses on the firing behavior'of data flow graph operators which must ensure a 
maximum capacity of one value per arc as dictated by the? Dcrmis-Misunas architecture. While 
restrictions of other data flow architectures may be less severe, the need to place some finite limit on arc 
capacity is common to most. The transformation of ares within data flow graphs to data/acknowledge 
arc pairs is introduced as a means of implementing the desired operator behavior. A format argument 
establishes that the safe operation resulting from the transformation is guaranteed^ and that the liveness 
and functionality of the graph is not altered. The use of data/act ndwledge arc pairs does however have 
a profound effect on operator firing sequences within a given graph, and therefore on its throughput 
The remainder of the thesis explores the consequences of incorporating d/a arc pairs and suggests 
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methods of modifying the transformation algorithm to improve graph performance. 

Though safe operation is achieved by preventing any given operator from firing until 
appropriate acknowledges are received, the delayed firing of an operator may cause a subsequent and 
unnecessary delay to operators dependent on its output This phenomenon is the subject of chapter 3. 
The algorithm developed i&ihis chapter dimuutes potential bottlenecks within a graph by buffering 
ares with identity operators so mat all paths through me graph are an equal length: Analyses of 
performance show that this approach maximizes diroughput. but at a potentially high cost in terms of 
identity operations. While performance statistics indicate that, th» latter strategy is promising, the 
choice of an optimum buftbritig scheme is complicated by the number of interacting factors. 

A second approach for <>ptimb^ a u^nsfbrmed data flow graph, which jams to decrease 
overhead by eliminating unnccded acknowledge arcs* is 'discussed m chapter 4. By identifying 
situations in which particular arcs. do not depend-on an acknowtedgement topfevent multiple token 
occurrences, the number of acknowledge arcs can be minimized. Tfcw is accomplished by analyzing the 
data flow graph implementation of each Vai. construct to find are pairs that may be subject to 
acknowledge arc removal* and specifying rules which eoahfc these situations to be recognized The 
chapter concludes with several examples illustrating this uptis^atfoa. While th« techniques of 
balancing token flow and removing unnecessary acknowledge ares have been developed independently,, 
the optimum configuration for any given data flow graph is reached by application of both 
optimizations. The absence of spcriffc in fornu»tion about hard*a^ 

prevents die development of an algoridaon combining the twoat (bKtimc^showevw* aa attempt is made 
to identify the major factors contributing to the choice of optimizations. These issues developed m 
chapters 3 and 4 should prove applicable to translation and optimization problems arising hi other data 
flow models. 
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5.2 Directions for Future Research 

Three areas of research arc natural extensions of the work presented. The first focuses on 
further development of the chapter 4 optimization. The work presented analyzed the Vai. conditional 
and iteration constructs to determine the circumstances under which certain arc pairs could safely 
function without an acknowledge arc. A more extensive study of data flow graphs containing these 
constructs would be useful in determining the completeness of the rules presented. Certain graph 
configurations may reveal additional cases to test for in removing acknowledge arcs, thus leading to an 
extension of the proposed rules. A more straightforward task involves application of the chapter 4 
analysis to the remaining Val constructs. This work is required for the development of a recursive 
algorithm which could perform acknowledge arc removal for the data flow graph representation of a 
program. 

A second avenue of research centers on performance evaluation of data flow graphs. As data 
flow computer prototypes become available, the type of performance analysis shown in chapter 3 
should produce more accurate data. Statistical studies can be made of token flow patterns for various 
graph configurations, and corresponding optimization schemes. Information gathered should 
determine when or whether the benefits of an optimized graph outweigh the cost incurred. A study of 
different configurations of a single data flow graph should provide valuable data on optimization 
tradeoffs. This would contribute invaluable information toward formulating an algorithm integrating 
the optimizations of chapters 3 and 4. 

Finally, the research can be extended to include more traditional optimization techniques. 
This would initially require a determination of which of these optimization strategies are applicable 
and adaptable to data flow. While redefining optimizations such as strength reduction seems possible 
and fairly straightforward, the adaptation of other traditional optimizations to a parallel processing 
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contcxt may require a different set of considerations. A data flow version of Acse optimizations could 
depend on the development of certain tools, such as a categorization of equivalent graph 
configurations. A comprehensive examination of the application and meaning of such traditional 
optimizations in data flow remains. The potential in following this route, and of further developing 
optimizations particular to data flow computation is just beginning to be tapped. The extensive history 
of sequential programming optimization techniques will no doubt have its counterpart in the world of 
dataflow. 
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